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Social networks have become an essential part of our lives today, at least in 
their virtual dimension, and the image of the web world is almost impossible 
without the presence of this pervasive phenomenon. These networks are one 
of the important components of the information infrastructure, such as twitter 
networks, facebook networks, and so on. In the analysis of social networks, 
one of the important issues is the detection of community. Each community is 
a group of network nodes so that the connection between nodes within the 
group with each other is more than their connection with other network nodes. 
Various methods have been proposed for community detection. One of the 
existing methods is based on data stream clustering. The output data of a social 
network can be modeled with a data stream. Fast and accurate clustering of 
this data stream can be very effective in the detection of community. In this 
research, using a fast and accurate online clustering algorithm, the community 
is detected. The simulation results indicate that the method proposed in this 
research can calculate the number of clusters optimally and perform better 
than similar methods. The proposed algorithm can be used in many other 
applications. 
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1. INTRODUCTION 


Today, internet connection statistics and users are increasing exponentially. The fast spread of the 
internet, digital, and satellite technologies have made real-time communication between huge parts of the world 
possible. Consequently, several national information controls in different countries have become useless. 
Nowadays, the role of the media and their influence in the political construction of the world is not secret from 
anybody. Communication theorists believe that the world is in the hands of those who own the media. The key 
role of the media in influencing public opinion has caused the importance of the media to be considered to this 


extent. 


Today, social networks are the rudder of the unsettled sea of the internet. Networks that are with 
virtual social orientation and based on technology play an essential part in the media equations of the world. 
These networks offer the chance to use several opportunities in the internet space, including reading, sharing, 
and searching news, uploading videos and photos, writing posts and membership in different groups, and 
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political mobility. This has caused internet users to favor social networks. Social networks are generally 
composed of organizational or individual groups that are connected through one or more sorts of dependencies. 
In the context of a complex information society, these networks show the effective functioning of the convergent 
network and its popularity and success. Their increasing number is because of their social color [1]-[3]. 

Social networks, or in other words, social media, are developed with data related to humans that are 
usually produced by them and often include their social characteristics. Social network analysis, which is 
sometimes abbreviated as SNA and sometimes called complex dynamic networks, means the process of 
examining and evaluating the structures of a graph of human interaction that are connected to each other by 
communication lines. In fact, social network analysis is the process of researching and examining social 
structures using network theory and graph theory [4], [5]. 

Graphs as mathematical structures that show the relationships of objects together at a suitable level of 
abstraction have been widely used in modeling various problems. For this reason, having suitable tools for their 
analysis has become a necessity and many researchers in various fields have provided methods for this work. 
One of the important analyzes performed on graphs is the clustering of graph nodes. The clustering of nodes 
in the graph is actually the same as the problem of recognizing graph communities, provided that the graph 
nodes correspond to the data points and the shortest distance between the nodes is considered to be the distance 
between the points. One of the most practical problems in graph-based social network analysis is the data 
clustering problem [6]. 

The purpose of clustering is to extract parts of the data that are very similar. In other words, in 
clustering, we split the data into groups in such a way that the data in each group have similar or identical 
characteristics, and the data from separate groups have diverse characteristics. For example, if we take into 
account people’s political characteristics (regardless of how they are socially connected), the clustering results 
will ideally divide people into groups that each have similar political leanings, as well as people from different 
groups. They have less similarity in political attitude. Clustering methods are used in various applications, 
including data simplification, data analysis, data similarity measurement, and finding patterns. The problem of 
clustering in graphs is also called community detection. As shown in Figures 1(a) and (b), communities are 
groups of network nodes that are closely related to each other and have relatively little communication with 
nodes outside the network. Community detection and graph clustering in a social network help to simplify and 
better analyze it [7]. 


(a) (b) 


Figure 1. The detection uses users’ similarity in online activities (topology) and their profiles (attributes) for: 
(a) a graph where nodes represent users in a social network and (b) two communities based on the prediction 
of users’ professions [5] 


There are many problems with community detection. One of these problems is the large amount of 
data. To put it better, the data generated from social networks is huge. So community detection is an NP-hard 
problem and has not yet been solved to a satisfactory level [7], [8]. Therefore, recent research has led 
researchers in these fields to develop algorithms that have low computational overhead and high scalability. 
Our contribution in this research is to provide a fast and accurate algorithm for the detection of communities 
in social networks with large data volumes. The proposed algorithm is based on fast data stream clustering. In 
order to increase the speed of the proposed algorithm, the crow search optimization algorithm [9] has been 
used. Such a design helps a lot to create an intelligent algorithm with a low computational load. The rest of this 
study is as follows. In the second part, related works are reviewed. In the third part, the proposed method is 
presented. This method is based on the optimization of clustering parameters to create a fast algorithm. In the 
fourth section, the simulation results are presented. 
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2. RELATED WORKS 

Many researchers have been attracted to the stream clustering field in recent years. Thus, many 
algorithms have been proposed, for example: clustream [10], stream [11], dstream [12], and denstream [13]. Some 
algorithms “transfer” the conventional algorithms of clustering to the scenario of the data stream; for example the 
stream algorithm [11] embraces the k-means version for data streams. On the other hand, some algorithms, such 
as clustream [10], propose methods that are designed specifically for data streams. Many surveys exist in the 
related literature due to the interest in this field. Early research by Guha et al. [14] presented an overview of the 
field and its challenges. There are also more recent surveys, like the research of of Aggarwal [15] and research of 
Silva et al. [16]. A survey by Amini et al. [17] concentrates on density- and grid-based stream algorithms for 
clustering. Gu and Angelov [18], AD_clustering is presented, which is a novel autonomous data-driven clustering 
method for processing live data streams. This noval proposed method is entirely based on the data samples and 
their ensemble properties and it is fully unsupervised; this means that there is no need for problem-specific or 
user-predefined parameters and assumptions, which is a problem for most of the current clustering methods. 
Some deep learning methods for community detection can be reviewed in [19]-[21]. In short, due to the 
importance of clustering in data analysis and the extensive use of graphs in problem modeling, the clustering 
of graphs has been particularly noticed by researchers, and various methods have been presented for it. Since 
researchers from different disciplines have worked on this issue, there are different approaches to it. But the 
generality of most of these methods is to find subgraphs that have many internal connections and few external 
connections. In this research, a fast and accurate algorithm for the detection of communities in social networks 
with large data volumes is presented. This method is actually an improvement on the [18] algorithm by 
intelligently choosing parameters for big data. 


3. PROPOSED METHOD IN COMMUNITY DETECTION 

The proposed method of this research is suitable for big data. In the problem of detecting communities, 
it is tried to put nodes that have a lot of connection with each other in one community and nodes that have little 
connection with each other in separate communities. Graph representation algorithms, similar to community 
detection methods, try to provide a two-dimensional or three-dimensional image of the graph in which the 
nodes that are connected to each other are located in the same region and the points that are not connected to 
each other are located in distant regions. We assume that the data received from the social network is modeled 
in the form of a data stream (1). 


Data Stream 


{x_1, x_2,x_3,x_4,x_5,...,x_k,...} (1) 


The distance between these data is also called d(x_1, x_2). Using the euclidean distance, the shortest 
distance between two points is calculated according to the pythagorean relation. If the absolute value of the 
distance between the components of the points is used instead of the square of the distance between the 
components, the distance function is called manhattan. A more general form of euclidean and manhattan 
distance for convex or convex shapes is called minkowski distance. According to the nature of multivariate 
data, a suitable distance should be used in each node. If the data are continuous in the form of a multivariate 
distribution, using the euclidean distance gives the best results, but if there are outliers, the manhattan distance 
is suitable for quantitative data. One of the appropriate methods for clustering the data stream (1) is presented in 
[18] Figure 2. The input of this algorithm is data 1, and its output is cluster d, which we model with relation (2). 


{c_1, c_2,c_3,c_4,c_5,...,c_n } (2) 


The proposed method in [18] is capable of recursively updating its self-defined parameters using only 
the current data sample meanwhile discarding all the previous data samples, and it also evolves its structure 
automatically depending on the experimentally observable streaming data. Figure 3 shows an example of the 
process of forming new clusters. 

As mentioned, there are many problems with community detection. One of these problems is the large 
amount of data. To put it better, the data generated from social networks is huge. The integration process in 
[18] is time-consuming for large data, and this makes the algorithm go out of real-time mode. With a proper 
definition of the problem and the selection of fixed parameters, the optimal form, and using the crow search 
algorithm, this algorithm can be customized for large data. The crow search algorithm [3] is the basic algorithm 


Detecting community on social networks with fast and optimal ... (Muneer Sameer Gheni Mansoor) 


324 m) ISSN: 1693-6930 


of this research. Crow search algorithm (CSA) is a population-based technique that works on the idea that 
crows store their surplus food in secret places and retrieve it when food is needed [19]-[29]. According to this 
algorithm, crows determine their new and updated positions in the search space. For this purpose, each crow 
randomly chooses one of the crows in the flock (for example, crow j) and follows it to discover the location of 
the food hidden by this crow (mj) (this process is repeated for all crows). The new position of crow i is obtained 
using (3). Various parameters can be optimally selected with the crow search algorithm. In this study, the value 
of n chosen intelligently and optimally. 


XA (i, itr + 1) = {X 0" +r; x fle" x (mi — Xi") r; > AP!" a random position otherwise (3) 


Data Stream 


Figure 2. The clustering algorithm transforms the data stream into specific clusters 


O- GD f 
O- © 


Figure 3. An example of the process of forming new clusters 


4. SIMULATION OF THE PROPOSED METHOD 

MATLAB2022 is used to simulate the proposed method. In order to measure the accuracy and speed 
of the method, large random data has been generated in large dimensions. In our proposed method, 
AP — fl — itfmax — N parameters are selected by default. Table 1 shows the selected parameters of the crow 
search algorithm. 


Table 1. Selected parameters of the crow search algorithm 
Parameters 18 algorithm parameters 


N 20 
itr nas 5000 
AP 0.1 
fl 2 


Figure 4 shows the clustering results for 20,000 random data. The figure on the left shows the output 
cluster of the algorithm without the CSA algorithm. In this case, n is equal to 2 (default), and the code execution 
time is about 5.44 seconds. The number of clusters is equal to 6. The figure on the right shows the output cluster 
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of the algorithm in the presence of the CSA algorithm. In this case, n is equal to 2.35, and the code execution 
time is about 2.51 seconds. The number of clusters is also equal to 3. Figure 4 shows the results of the proposed 
method for 20,000 data, left: without n-optimization, right: with n-optimization. One of the goals of this 
research is to solve the problem of increasing data. Figure 5 shows the clustering results for 100,000 random 
data. The figure on the left shows the output cluster of the algorithm without the CSA algorithm. In this case, 
n is equal to 2 (default), and the code execution time is about 17.82 seconds. The number of clusters is also 
equal to 3. The figure on the right shows the output cluster of the algorithm in the presence of the CSA algorithm. 
In this case, n is equal to 2.15, and the code execution time is about 7.51 seconds. The number of clusters is equal 
to 2. Figure 5 shows the results of the proposed method for 100,000 data, left: without n-optimization, right: 
with n-optimization. It is clear from the results that a smart choice of n can make clustering about 2.5 times 
faster. The proposed method shows its ability with the increase of data. In this case, the time will be 
significantly reduced. Obviously, in this case, the number of outlier data will increase, but the accuracy of the 
proposed algorithm will not decrease. Therefore, the proposed method is very well customized for big data. 
The idea of this research can be expanded further in the future. One of the interesting ideas is to choose other 
parameters to optimize and find a better algorithm. These parameters can be different distance criteria or smart 
thresholds. A summary of the results is presented in Table 2. 


With_n-Optimization Without_n-Optimization 


Comparative Analysis 20,000 Data Points 


Result Analysis with n-Optimization Result Analysis 
Performance Improvement Performance Metrics 


Figure 4. Results of the proposed method for 20,000 data, left: without n-optimization, right: with 
n-optimization 


Cluster Result with CSA Large | (Cluster Result without CSA Large Cluster Result with CSA 
Data: 100,000 Data: 100,000 Data: 100,000 


Cluster Result without CSA 


Data: 100,000 


n: 2(default) 


| Execution Time: 17.82s | Execution Time: 5.44s 
Clusters: 3 Clusters: 6 


Figure 5. Results of the proposed method for 100,000 data, left: without n-optimization, right: with 
n-optimization 


n: 2(default) 


Execution Time: 7.51s 


Clusters: 2 


Execution Time: 2.51s 


Clusters: 3 


Table 2. Summary of results in two simulated cases 


Time Num of clusters n 
[18] method (20000) 5.44 6 2 
Proposed method (20000) 2.51 3 2.35 
[18] method (100000) 17.82 3 2 
Proposed method (100000) 7.51 2 2.15 
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5. CONCLUSION 

During the last decade, more and more attention has been paid to communication in modern society. 
Today, social networks with hundreds of millions of members are considered a powerful tool to guide the flow 
of information. Therefore, the study of different aspects of these networks has been considered by researchers. 
One of the important issues in social network analysis is community detection. Clustering is used to identify 
communities. There are many problems with community detection. One of these problems is the large amount 
of data. To put it better, the data generated from social networks is huge. Therefore, recent research has led 
researchers in these fields to develop algorithms that have low computational overhead and high scalability. 
Our contribution in this research was to provide a fast and accurate algorithm for community detection in social 
networks with large data volumes. The proposed algorithm is based on fast data stream clustering. The crow 
search optimization algorithm was used to increase the speed of the proposed algorithm. Such a design helps a 
lot to create an intelligent algorithm with a low computational load. The simulation results showed that the 
proposed algorithm works about 2.5 times faster. This customized algorithm can solve the clustering problem 
of sparse data with further improvement in future research. 
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